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Tofrard the Quantitative Analysis of 'Deviant' Articulation 

Timothy J« Baehr 
University of Michigan 

The evaluation of 'deviant' articulation (e.g. that of young children, speech defec- 
tive persons, aphasics, second- language learners) has usually consisted of two activities: 
transcription of the speech being evaluated, and comparison of the transcription against 
some 'standard' set of 'target' sounds* 

Any transcription is a description of a speaker's articulation in terms of the audi- 
tory and perceptual capacities of a transcriber* Obviously, a poorly trained or inatten- 
tive transcriber will produce false or misleading descriptions of speakers' articulation* 
However, even well trained linguists must have different hearing sensitivities and dif- 
ferent criteria for assigning various symbols to speedi sounds* 

The aim of this paper is to present a method for the transcription and quantitative 
analysis of deviant articulation* The perceptual abilities of the transcriber are accounted 
for and made an integral part of the method* 

Two approaches to the evaluation of articulation correspondences 

The simplest (or at least the most widespread) strategy in making a transcription 
and subsequent coiqparlson of deviant articulation against some standard is to express both 
the speaker's output and the standard in terms of a phonetic alphabet* A qualitative com- 
parison can be made, and percents correct can be calculated for each of the target sounds* 
This 'alphabetic' approach has several shortcomings: (1) Nothing is said about the arti- 

a 

culatory positions or processes involved, (2) The process the transcriber goes through in 
order to decide which of several hundred alphabetic symbols to select for a given sound 
is not overtly indicated in his transcription, (3) Quantifications (such as percent cor- 
rect) have questionable meaningfulness. 

Most investigators rarely stop, however, at a simple alphabetic transcription* They 
re-analyse, at least partly, their alphabetic data in terms of some phonetic attributes* 



R6‘~sn&ly8ls do68n*t Add anything new to the dataj it reorganizes it into a more interesting 
or more revealing form. Thus, an 'analphabetic* approach is usually a tautological, a 
posteriori re- analysis of an alphabetic transcription. 

On the other hand, there should be no theoretical obstacle to starting a priori with 
a suitable analphid^etlc transcription method based on articulatory parameters. The anal- 
phabetic method offers an experimenter the opportunity to determine how successfully trans- 
cribers can use articulatory parameters as perceptual categories in making an analphabetic 
transcription. 

Analphabetic classification system 

General principles : Speech is segmentable into descrete entitles which may be called 

'phones' . Just how segmentation is to take place is a problem which has plagued linguists 
and designers of speech-recognition devices for years. Although almost any linguist has 
little trouble segmenting the speech he wishes to transcribe and analyze, the process of 
segmentation h as eluded exact specification . Some elucidation of the processes and prin- 
ciples of segmentation may be found in Pike (1943). For the purposes of this paper, it 
will be assumed that segmentation is a perceptual process distinct from description, and 
I that all transcribers have equally good skills of segmentation. 

' Since no two phones ever uttered will be exactly alike, it would in practice be im- 

possible to synd)olize phones unless the symbols constituted a nearly infinite set. On 
the other hand, phones may be classified together in a finite set of 'phone types', each 
phone type being assigned a different syiid>ol. The classification of phones into phone 
types may be considered the basis of alphabetic transcription. 

A phone type, then, is defined as the simultaneous Intersection of a segment and the 
terms of a classificational system. The class if icational terms constitute the analpha- 
betic component of transcription. For the purpose of the present paper, it will be con- 
venient to specify some formal and practical requirements that the classificational sys- 
tem and its terms must meet: 

(1) The system must be composed of a finite set of categories. 



o 



(2) The categories must be relatable to observables at at least one stage 
(e.g. articulation, acoustic signal, audition, perceptioi^ decoding) of 
the speech event. 

(3) The categories oust yield phonetic descriptions (either in terms of the 
categories themselves or the phone types derived from the category com- 
conqplexes) adequate for at least distinguishing dialects from one another. 

(4) The categories must be uniform, that is, each category must have the same 
degrees of mentorship or specificity. 

(5) The categories may overlap someidiat, but not to the point of yielding an 
ambiguous or phonetically redundant description of a given phone type. 

It should be obvious that there is an intimate relationship between (1) , (4) , and (5) . 
The descriptive power of the categories and the resultant phone types is a function of the 
number of categories, their degree of specificity, and the amount of overlap among them. 

Only the first requirement is strictly formal. The others are necessary for the 
elegance and manageability of the analphabetic system. 

Articulatory categories t A set of phonemic categories that otherwise seems to fit the 
above requirements fairly well is Jakobson's 'distinctive features' (Jakobson, Fant, and 
Halle, 1963; Jakobson and Halle, 1956). Finer phonetic distinctions among phone types may 
be obtained if the feature system can be expanded and modified somewhat. This expansion 
and modification is discussed below. 

Jakobson and his colleagues relate the feature system to various stages of the speech 
event. Their terms are a coinbination of articulatory (e.g. 'Nasal'), acoustic ('Diffuse'), 
and auditory ('Strident') terminology. However, since the transcriber would be making 
judgments about the articulatory performance of speakers, such a combination of termin- 
ologies would seem unnecessary and possibly disruptive, unless the transcriber were al- 

A 

ready familiar with distinctive features and their articulatory correlates. Therefore, 
it is assumed that the feature terminology may be easily translated into 'articulatory 
categories'. Such a translation would maintain the uniformly binary nature of the cate- 
gories, and some of the theoretical reasons for blnariness (Jakobson, Fant, and Halle, 

1963, Chapter 1). The most salient point, however, especially in regard to requirement 

(4) above, is the uniformity of this binariness across all categories. 

o 

ERIC 



A tentative set of articulatory categories Is described briefly below, with devia- 
tions from and additions to Jakobsonlan terminology noted. Basic to much of the descrip- 
tion are Halle's (1964) postulated four degrees of narrowing In the vocal tract: Contact, 

Occuluslon, Obstruction, and Constriction. These four degrees of narrowing are charac- 
terized by stops, fricatives, glides, and high vowels, respectively. 

VOCALIC— NON- VOCALIC. Vocalic sounds have a degree of narrowing not exceeding con- 
striction; Non-Vocallc sounds have a degree of narrowing that exceeds constriction. 

CONSONANTAL — NON-CONSONANTAL. Consonantal sounds have a degree of narrowing equal 
to or exceeding occlusion; Non-Consonant al sounds have a degree of narrowing less than 
occlusion. 

INTERRUPTED— CONTINUANT. Interrupted sounds have a degree of narrowing equal to con- 
tact; Continuants have a degree of narrowing less than contact. Certain nasals (e.g. [m], 
[n], etc.), because they open an alternate 'escape route' for the alrstream (l.e. the 
nasal passage), are described as Continuant. 

EDGED — NON-EDGED. Edged sounds Involve the forcing of the alrstream over a relatively 
sharp edge, such as the teeth or uvula. In addition. Edged sounds must have a degree of 
narrowing that exceeds constriction. The Jakobsonlan term for this category Is Strident- 
Mellow. 

PERIPHERAL- 1 — NON-PERIPHERAL- 1. Perlpheral-1 sounds have a primary narrowing at 
either of the oral peripheries, the lips or the velum; Non-Perlpheral-1 sounds have their 
primary narrowing elsewhere. The Jakobsonlan term Is Grave — Non-Grave. 

PERIPHERAL- 2 — NON-PERIPHERAL-2. Perlpheral-2 sounds have a secondary narrowing at 
one of the oral peripheries. Non-Perlpheral-2 sounds either do not have a secondary nar- 
rowing or have one located elsewhere. The Jakobsonlan term Is Flat -Plain. 

MEDIAL- 1 — NON-MEDIAL-1. Medlal-1 sounds are articulated with a primary narrowing In 

the middle of the vocal cavity; Non-Medlal-1 sounds have their primary narrowing elsewhere. 

The Jakobsonlan term Is Acute— Non-Acute. This category generally applies only to the 

vowels and glides; for the consonants and liquids, Perlpheral-1 — Non-Perlpheral-1 usually 

provides adequate descriptive specificity, 
o 
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MEDIAL-2 — NON-MEDIAL-2 . Medial-2 sounds have a secondary narrowing at the palate; 
Non-Medlal-2 sounds either do not have a secondary narrowing or have one elsewhere. The 
Jakobsonian term is Sharp-Plain. 

C];X)SE NON-CLOSE. Close sounds must be articulated with the mandible closed or 

nearly closed. Thus, the front consonants and liquids and the high vowels and glides can- 
not be articulated if the mandible is opened beyond a certain point, without strenuous com- 
p0i\gi^r;^on in the pharyngeal cavity. The Non— Close sounds do not have this restriction on 
mandible-closure. The Jakobsonian term is Diffuse— Non-Dif fuse. 

OPEN — NON-OPEN. Open sounds are articulated such that the mandible may be opened to 
its widest extent. No degree of narrowing equal to or exceeding constriction is possible. 
Therefore, this term describes the open, or wide, vowels. Non-Open sounds cannot be artic* 
ulated with the mandible opened to the extent permitted among the Open sounds. The 
Jakobsonian term is Conq>act— Non-Compact. 

NASAL— NON-NASAL. Nasal sounds are those produced by directing part of all of the 
airstream through the nasal cavity; Non-Nasal sounds are produced by directing all of the 
airstream through the oral cavity. 

VOICED— VOICELESS. Voiced sounds are produced with vocal cord vibration; Voiceless 

kounds are produced without vocal cord vibration. 

TENSE— LAX. In Tense sounds, the articulators spend a relatively longer time away 
from a neutral 'rest* position than in Lax sounds. (An alternative definition has to do 
dth the relative amount of air pressure posterior to the point of narrowing. The writer 
finds that definition less convincing.) 

The next three categories are due to J. C. Catford (personal coiDmunication, 1965); 

|the writer, however is responsible for their particular application. 

E6RESSIVE— IN6RESSIVE. Egressive sounds are those in which the airstream flows out 
of the vocal tract; Ingressive sounds are those in which the airstream flows into the 
vocal tract. 
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PULMONIC--NON-PULMONIC. In Pulmonic sounds, the alrstream Is Initiated at the lungs 
(by pressure of the diaphragm and/or the Intercostal muscles); In Non-Pulmonlc sounds, 
the alrstream Is Initiated elsewhere. The vowels are all Pulmonic. 

GLOTTALIC — VELARIC. G.lottallc sounds are those which are Initiated by compression 
or expansion of the glottis; Velarlc sounds are Initiated at the velum. The closest cor- 
responding Jakobsonlan term Is Checked — Unchecked, although Its definition Is somewhat 
different. Sounds can be described as Glottallc or Velarlc only If they are also Non- 
Pulmonic. 

Stress, tone, and length are not Included In the set of articulatory categories. The 
writer feels that these three aspects of speech are more properly evaluated as part of the 
analysis of syllables, not segments. Moreover, stress, tone, and length can be made bin- 
ary only In the most artificial way. 

It seems at this point that articulatory facts are being set up as perceptual cate- 
gories or features. In a sense, this Is true; but since the transcriber shares the same 
articulatory mechanisms with the speaker. It Is reasonable for him to perceive In terms 
of articulatory parameters. Moreover, If one considers the features to be merely names or 
labels of corresponding auditory and perceptual phenomena, then the question of terminology 
becomes trivial. 

Phone types : A phone type Is defined as the simultaneous Intersection of a 'segment' 

with the sixteen articulatory categories. Each segment-category Intersection Is marked In 
one of three ways: With (1) a Plus or (2) a Minus, Indicating whether the segment has one 

or the other value of the binary category, or (3) a Zero, Indicating an 'Impossible' In- 
tersection due to the way the categories have been defined. For example, the vowels 
(marked Plus Vocalic and Minus Consonantal) cannot be also marked Plus or Minus Edged. 
Likewise, Plus or Minus Open cannot be used to describe segments otherwise marked as Plus 
Consonantal. 

The StMidard : The standard articulations against which a given deviant speaker Is 
to be compared are presented as a chart. Across the top of the chart Is an Inventory of 
phone types of the speech community being used as the standard of comparison. Down 
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the left side of the chart are the articulatory categories. At the intersection of a given 
category with a given phone type a ‘plus* or ‘minus* indicates whether the phone types has 
one or the other attribute of the category. Impossible intersections are marked with a^ 
‘zero*. The impossible intersections are not language-specific; they are due to the way 

the categories are defined. 

An illustration of what a Standard should look like is given in Table 1. The Standard 
is of the writer* s dialect of Midwest American. 



Insert Table 1 about here 



Sometimes it may be desirable to introduce further zeroes into the Standard. This 
would be the case if there were a range of acceptable phone types that were acceptable 
variant articulations in a given test item. For instance, if it made no difference whether 
the speaker pronounced ‘frog* as [frag] or [frog], then the vowel nucleus would be marked 
0 Peripheral-2 and 0 Tense in the Standard. 

Collection of speech samples from deviant speakers 

The specifice procedures for collecting speech samples will vary depending on the 
speaker or speakers to be tested. However there are some general criteria that snould be 
observed: 

(1) If at all possible, the speech sample should be elicited in the form of 
single-word isolates, that is, preceded and followed by a pause. This is 
to reduce the possibly unpredictable effect of syntactic, morphological 
and prosodic environment on articulation. As an alternative, the speech 
may be elicited as part of a consonant verbal framework. This criterion 
need not apply, of course, if the verbal framework is manipulated as an 
Independent variable. 

(2) Unless imitative ability is being evaluated, the speech sample should not 
be obtained by imitation. Instead, the speaker may be asked to name cer- 
tain items or actions, answer leading questions (such as "What is the 
opposite of hot?"), etc. Imitation would, of course, allow the use of 
nonsense syllables if this were required in the experimental design. 

(3) Each word (or nonsense syllable) elicited from the speaker must be iden- 
tified; otherwise the transcriber will have no idea what the target sounds 
were. One of the easiest ways to obtain identification is to have a 
specified order of presentation of the stimuli. An even safer strategy is 
to include the Identifications on the sample tape recordings. 

o 
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(4) The sample size should be large enough to allow for statistical analysis 
of the transcrlptlon(s) . There can be no rigid rule for sample size, but 
the sample should probably include not less than 75-100 words (containing 
375-500 elicited phones) . 

(5) The sample should be recorded on the best recording equipment available. 

Transcription method 

Straiaht transcription : Two tape recorders are used. One plays the tape recording 

obtained from the speaker; the other is fitted with a recursive tape loop of about eight 
seconds' duration at 7 1/2 i.p.s. The tape machine plays into the loop machine, which is 
on 'Record' mode. The transcriber listens to the tape through earphones. When he hears 
a word spoken by the speaker, he switches the tape machine to 'Stop' and the loop machine 
to 'Play' . This allows the transcriber to listen to the utterance repeatedly without con- 
tinually back-spacing the main tape. Mode-switching on the tape recorders is by remote 
control. See Figure 1. 

Insert Figure 1 about here 

The transcriber uses a transcription form similar to the one in Figure 2. As he 
listens to an utterance, the transcriber marks each segment-category intersection with 
a plus or minus to indicate what he has heard. Thus, for each speech sound the trans- 
criber hears, he has to make sixteen decisions about the parameters of the sound. Im- 
possible or other Zero intersections (as described above) are either left blank or marked 
with a Zero. 



Insert Figure 2 about here 

Differential transcription : Differential transcription differs from straight trans- 

cription in that the transcriber marks on the transcription form only those segment-cate- 
gory intersections which were Vnissed' by the speaker, vis "k vis the Standard. Although 
the transcriber still must make sixteen decisions about each speech sound, the tedium of 
having to mark every intersection on the form is eliminated « The 'correct' intersections 
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can be recovered later, if necesaary, by referring back to what the target sounds were in 
terms of the standard. 

Trai^ng: Training consists of two steps: (1) Familiarization with the articulatory 

categories and their definitions, and (2) Transcription of a tape of stimulus-words or 
nonsense syllables in which the articulatory parameters are known and controlled by the 
speaker. After each stimulus, the correct transcription is given and the trainee's mis- 
takes are discussed. 



Perception and decision processes in transcription 
A transcription method that accounts for the perceptual abilities of the transcriber 
must include an explicit statement of its underlying theory of perception. The material 
in this the following section is based almost entirely on the Theory of Signal Detec- 
tability (Tanner and Birdsall, 1958; Clarke, Birdsall, and Tanner, 1959; Swets, Tanner and 

Birdsall, 1961; and others, cf. Swets, 196A). 

nf A 'Signal* will be defined as that part of the speaker's acous- 

tic output which carries information about the articulatory positions and processes used 
in speaking. This acoustic output need not be distinct from or independent of that which 
carries other information such as rhythm, stress, syntax, affective state of the speaker, 
etc. An 'Ideal Signal' is defined as that which is produced by a normal speaker using his 
native language or a trained speaker producing nonsense syllables (cf. a previous paragraph 
on Training). A *Non-Ideal Signal* is that which is produced by a (real or suspected) de- 
viant speaker. 

For each target sound attempted (by either a normal or deviant speaker) there are 
about sixteen 'intentions* attributable to the speaker concerning the articulatory compon- 
ents of the sound. These sixteen intentions correspond to the articulatory categories 
which simultaneously intersect the target sound segment (no intentions are attributable 
for Impossible or Zero intersections). Clearly, intention is a hypothetical construct and 
does not iflq>ly that the speaker is usually aware of manipulating the various articulatory 




parameters. 
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Because of the way the articulatory categories are defined, each Intention is binary. 
However, the physical realization of the intention is not binary. For example, if in a 
given sample of speech, 100 sounds were intended to be voiceless, they would not all be 
absolutely voiceless in their realization. Because of other factors (articulatory posi- 
tions, prosody, etc.), some intended voiceless sounds may be very slightly voiced. The 
distribution of the 100 intended voiceless sounds might look like Fig. 3. 



In the same sample of speech, 100 sounds may have been intended to be voiced. Again, 
the physical realization of this intention will exhibit a certain distribution along the 
degree of voicing continuum. This distribution of intended voiced sounds may be plotted 
on the same coordinates as the intended voiceless sounds, as in Fig. 4. 



It will be noticed that in Fig. 4 the two distributions overlap sli^tly. This over- 
lap, to a greater or lesser extent, is to be expected for all the articulatory categories. 
Unless the speaker is some sort of perfect automaton, the realizations of the binary in- 
tentions attributed to him will always exhibit a certain degree of variability. 

Perception ; The transcriber's task may be very crudely characterized as an attenqpt 
to determine what the speaker's intentions were for each articulatory category for each 
speech sound segment. Because the distributions for a given category overlap, even a 
'perfect' transcriber will not be able to give a totally accurate accounting of the 
speaker's intentions. In fact, his error will be directly related to the amount of over- 
lap. Thus, the 'perfect' transcriber is neither perfect nor a transcriber but an ideal 
mathematical device which can utilize all the information contained in the distributions 
produced by the speaker. Henceforth in this paper, the mathematical device will be called 
an 'Ideal Observer'; a 'live' transcriber who actually makes transcriptions will be called 

a 'Mon-Ideal Observer'. 

o 



Insert Figure 3 about here 



Insert Figure 4 about here 
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The following is a description of how a Non-Ideal Observer processes the Signal 
(Ideal or Non-Ideal) in order to make a transcription. 

For each segment » the transcriber receives a complex of sensory data, which will be 
symbolized X. This complex, X,will consist of information regarding the physical (i.e. 
acoustic) realization of the speaker's intentions. The information may be expressed in 
terms of the articulatory categories; therefore, x^^ is a single datum where i is a 
given segment and j is a particular category. Each segment received by the transcriber 
is then defined as the set: 



, x^2 » * • • j , . » . ^ • 



( 1 ) 



Associated with each segment-category intersection x^^ are two probabilities or likeli- 
hoods. One is the likelihood that x^^ arose from the speaker's intention to give a 
'Plus' value to the category j; the other is the likelihood that x^^ arose from the 
speaker's intention to give a 'Minus' value to the category. (Henceforth, only one cate- 
gory will be considered. The process, of course, is repeated for all the categories. 
'Zero' intersections have no relevance to the process, since there is no intention attri- 
butable to the speaker.) 

All the information relevant to the transcriber for describing the speaker's inten- 
tion at a segment -category intersection may be expressed as a single-number likelihood 
ratio (that is, the ratio of the two likelihoods discussed above): 



f(x , I 'Plus') 

A(x,,) ^ 

^ f(x^j I 'Minus') . 



( 2 ) 



If many segments are described in terms of a given category, the likelihood ratio can be 
plotted on a one-dimensional axis. Any monotonic transformation of likelihood ratio will 
be equally useful; the natural logarithm of likelihood ratio leads to convenient statis- 
tics (Tanner and Birdsall, 1958). Log^ >(x^^) is plotted on the abscissa in Fig. 5. The 
ordinate is the probability density of log^ A(x^^). The left-hand distribution in Fig. 5 
Is conditional upon the 'Minus' intention; the right-hand distribution is conditional 




upon the 'Plus' intention. The two distributions are assumed to be normal and to have 
equal variance. 



Insert Figure 5 about here 

Fig. 5 preserves intact all the information contained in Fig. A. If the values of 
Fig. 5 could be calculated, then the difference in the means of the distributions divided 
by the standard deviation would yield a detectability index (d') of the speaker's use of 
category J . 

Only the mathematical device (Ideal Observer) can determine the exact value of this 
d', so it will be subscripted: d'^^. Since a Non-Ideal Observer can only estimate the 

^^^®l^*^oods and their ratio, his performance will be somewhat less accurate than the 
Ideal Observer's. The Non-Ideal Observer's performance has the effect of increasing the 
variance of the two distributions, thus depressing the value of the detectability index. 
The d' expressing the Non-Ideal Observer's detection of the speaker's intentions in regard 
to category j is also subscripted: 

In summary, the Non-Ideal Observer operates on a sensory datum, x^^, and the likeli- 
hoods of the intentions giving rise to it. His estimates of the likelihood ratio for each 

will be at variance with the likelihood ratios of an Ideal Observer operating on the 
same data. The distributions in Fig. 6 express the performance of a Non-Ideal Observer 
processing the same information found in Figs. 4 and 5. 

Insert Figure 6 about here 

It is postulated that 

^*NIO ^*10” (3) 

Decision: What does the transcriber do with the likelihood ratios he estimates? 

Certainly he doesn't actually plot their distributions as in Fig. 6. It was claimed that 
any particular sensory datum, x. . , yields a single-number likelihood ratio A(x..). This 
value (ignoring the logarithmic transformation for the moment) will range from 0 to +•. 



If A(x^j) Is vary large. It is only reasonable to suppose that the transcriber will de- 
cide that the speaker's Intention was to make category J 'Plus' for the segment 1. 
Likewise, If A(x^j) Is very small, the transcriber will decide that the Intention was to 
category j 'Minus' for segment 1. This suggests that the transcriber may adopt a 
particular value of l.e. 6, and establish a decision rule such that he will re- 



spond 



'Plus' 

'Minus' 



r n 

> 



■ If A(x^j) ' 



6 



(4) 



The probability that the transcriber will respond 'Plus' when the speaker's Intention 
was Indeed 'Plus' Is the area, to the right of 6, under the probability density curve 
f (x^jl 'Plus') . The probability that the transcriber will respond 'Plus' Incorrectly, l.e. 
when the speaker's Intention was Indeed 'Minus', Is the area, to the right of 6, under 
the probability density curve 'Minus'). These two probabilities may be symbolized 

as follows: 

p(R^I'Plus') - ^“f(x^j I 'Plus') dx^j ; (5) 

p(R^ I 'Minus') ■ ^“f(x^j I 'Minus') dx^^^^ , (6) 

where R** means a 'Plus' response by the transcriber. 

Fig. 7 Is Identical to Fig. 6, except that 3, p(R^| 'Plus') , and pCR*** I 'Minus') are 

shown. 



Insert Figure 7 about here 



If the two distributions are normal and have equal variance, then pCR***! 'Plus') and 
pCR*** I 'Minus') describe completely the Information contained In Fig. 7. (The other prob- 
abilities, p(R~| 'Plus') and p(R~| 'Minus') , are merely the complements of pCR***! 'Plus') 
and pCR***! 'Minus') , respectively.) A table has been constructed (Elliott, 1959) to find 
the d' value from p(R^| 'Plus') and p(R*^| 'Minus') . The two probabilities can be estimated 
from the transcriber's actual transcription; this Is discussed below, 
o 

me 
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Independence of d* and B * It will be seen by inspection of Fig. 7 that 8 can assume 
any value and not affect d' . This is a crucial aspect of the Theory of Signal Detectabil- 
ity. 8 is a direct measure of the Observer's response bias due to prior probabilities » 
affective state, payoffs, etc. In most theories of perception (notably threshold theories), 
response bias is not accounted for separately from the perceptibility score; it is merely 
another one of the factors contributing to the overall score. The detectability index d* 
is, then, a direct measure of an Observer's ability to detect a given Signal, regardless 
of his response bias. 

Estimation of response probabilities : For each category J, a confusion matrix can 
be constructed from the transcription. In order to construct the matrix, the acutal inten- 
tions of the speaker must be known or assumed to be known. 

The intentions of the speaker producing an Ideal Signal are assumed to be accurately 
describable by the speaker himself. This implies that the speaker must be familiar with 
the system of articulatory categories, and must be able to exert conscious control over 
his speech parameters. Conceivably, this speaker could produce utterances from some sort 
of 'script' describing beforehand the articulatory positions and processes to be used. 

The intentions of the deviant speaker producing the Non-Ideal Signal are defined as 
the analphabetic phonetic description, in terms of the Standard, of the target sounds he 
attempts. 

In both cases, the intentions are described as the Plusses and Minuses occurring at 
segment-category intersections . 

The confusion matrix is constructed as follows: The row entries in the matrix are 
the Plus and Minus values of the speaker's intentions for category J. The column en- 
tries are the Plus and Minus values the transcriber indicates in making his transcription 
(Of course, in making a differential transcription of deviant articulation, the trans- 
criber makes no overt indication of the segment-category intersections in which he thinks 
the speaker's production of the target sound agrees with the description of the inter- 
section for that target sound in the Standard) . In the cells of the confusion matrix are 

tabulated the frequencies of agreement and disagreement between the actual intentions 
o 



-15- 






(row entries) and the Intentions observed by the transcriber. An example of a confusion 
matrix for a given category is shown in Fig. 8. 

Insert Figure 8 about here 

The labels in the cells stand for the following: (a) The speaker’s intention was 

’Plus* and the transcriber detected it as ’Plus*, (b) The speaker’s intention was ’Plus’ 
and the transcriber detected it as ’Minus’, (c) The speaker’s intention was ’Minus’ and 
the transcriber detected it as ’Plus’, (d) The speaker’s intention was ’Minus’ and the 
transcriber detected it as ’Minus’. 

The sum a + b is the total number of segments for which the speaker intended a 
’Plus’ for category J. The sum c + d is the total number of segments for which the 
speaker intended a ’Minus’ for category j. The sum a + b + c + d is the total number 
of segments (target sounds in the case of the deviant speaker) uttered by the speaker in 
the sample. 

The probabilities given in equations (5) and (6) may be estimated by the following 
formulas: 



p(R+|'Plus') - , • ^ 


(5’) 


p(R*‘l' Minus') • J + d 


(6’) 



The detectability index (d*) for category j can be found by referring these two proba- 
bilities to Elliott’s (1959) table. 

Practical ran^e of d* I Theoretically, d* approadies infinity as p(R*’ (’Plus’) 
approaches 1 and p(R*‘ ('Minus') approaches 0. This approach to infinity is very grad- 
ual, however. In the most readily available table of d* (Elliott, 1959), the highest d’ 
is 4.64, when p(R*‘ (’Plus*) - »99 and p(R^ (’Minus’) - .01. When p(r‘‘ (’P lus’) and 
p(R**‘ (’Minus’) are such that d’ would exceed 4.64, it is safe to assign a d’ value of 
4.90. Unless sample size (total number of segments for a given confusion matrix) is very 
large (e.g. several thousand), even perfect detection can be assigned a d’ value of 4.90. 



o 
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The difference between d* - 4.64 end d’ - 4,90 is roughly proportional to the dif- 
ference between any two adjacent d* vanes in the Elliott table. 



Use of d* in the evaluation of ’deviant* articulation 



Thd. Nnn-Idaal Signal : The Non-Ideal Signal can differ from the Ideal Signal in two 

ways: (1) The means of the two distributions (cf. Fig. A) may be closer together, and/or 

(2) The variance of the distributions may be greater. It is therefore postulated that the 
detectability of the Non-Ideal Signal is less than the detectability of the Ideal Signal: 



< d' 



(7) 



NIS ' * IS* 

given that the Observer is the same in both cases. 

Mav<n.tnn for the Non-Idesl Obssrvejr : The greatest d* value a ’live* transcriber 

can achieve for a given category is for an Ideal Signal. This maximum d* can be deter- 
mined experimentally for any transcriber or group of transcribers. 

Efficiency of the deviant sneaker : This paper has defined two kinds of Signals and 
two kinds of Observers. The relationship of these in a diagram analogous to a one-way com- 
munication channel (after Tanner and Birdsall, 1958) is shown in Fig. 9. 



Insert Figure 9 about here 



By referring to the positions of the ’switches* in Fig. 9, the cumbersome subscripts 
used with the various d’ values can be simplified. The two postulates, equations (3) 
and (7), can be restated as follows: 



“'.2 



(3') 



and 






(7*) 



With the same Non-Ideal Observer (transcriber) used for making transcriptions of both 
Ideal Signals and Non-Ideal Signals, the efficiency of the speaker producing the Non-Ideal 

Signal can be estimated by 

( 8 ) 



n,.‘»'22i 
^ ^’l2j 



o 
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for each category j. Thus, the transcriber’s d* for the deviant speaker (Non-Ideal 
Signal) is weighted by his d’ for the Idr L Signal. 

The efficiency score is dependent on both the Non-Ideal Signal and the transcriber. 
Therefore, transcribers can be used interchangeably only if the efficiency of the Non- 
Ideal Signal relative to the Ideal Signal (i.e. n^) is the same (or nearly so) for both 
transcribers. An equivalent requirement is that the efficiency of the transcribers rela- 
tive to each other must be the same for both signals . In other words , if 



then 



d* 

221. Observer 1 
d* 

12j, Observer 1 

^ 221 . Observer 1 
d* 

22J, Observer 2 



d* 

221 . Observer 2 , 
d* 

12J , Observer 2 
d* 

121 . Observer 1 . 
d* 

12J , Observer 2 



(9) 



( 10 ) 



It should be noted that this requirement is quite different from one which would re- 
quire that every transcriber’s detection index be the same given a particular category of 
a particular Signal. The requirement stated in equations (9) and (10) means only that 
the Non-Ideal Obsei'vers’ relationship to each other be independent of the relationship 

among the detectabilities (as would be determined by an Ideal Observer) of the Signals 
Involved. 

Even if the required equalities of equations (9) and (10) are not met, the trans- 
cribers may differ from each other in some regular way. In such a case, the n ’s of one 
transcriber can be adjusted by some constant in order to achieve interchangeability. 

data from an ongoing pilot study indicate that the requirements of equations 
(9) and (10) can be met, with no adjustment or correction factor needed. 

Subsets: It may be useful and desirable to estimate n scores for subsets of tar- 

get sounds. For instance, it may be interesting to obtain five n’s for the category 
Voiced-Voiceless: (1) All sounds (in the sample); (2) Vowels only; (3) Consonants only; 

(4) Stops; (5) Non-Nasal continuants. (In English, it may be found that n la 

voicing 

very high overall, but that this score is Inflated by inclusion of the vowels.) 




It is 
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up to th« investigator io decide what kinds of subsets to look at, and to juatlfy them on 
linguistic or practical grounds. 

Of course, two d* scores must be obtained for each subset; one for the deviant 
speaker and one for the Ideal Signal. Thus, 

. . '*'221k (11) 

"f' 0'l2Jk 

for each subset k of category j. 

Discussion 

Limitations: The method of transcription and analysis presented in this paper is in- 

tended to describe only a very small part of what may be termed deviant speech. The method 
has value only when there are clearcut segments which may be compared, in one-to-one fashion, 
with the segments of some standard or target. This limitation excludes from consideration 
such interesting phenomena as omission, intrusion, metathesis, etc., at least when a quan- 
titative evaluation is needed. 

Another limitation is the specificity or 'fineness' of phonetic description achiev- 
able by a set of binary categories . The ^ hoc addition of categories as they are needed 
seems to be of doubtful value; eventually the system would be unwieldy and ridiculously 
complex. The writer feels, however, that the category system as it stands approaches the 
limit of reliable perceptibility by human observers. Investigations requiring finer dis- 
criminations would perhaps be better handled by the various instrumental techniques avail- 
able. 

Applications . The differential transcription method and its statistical analysis 
are being developed for investigations of the speech of very young children. The method 
is intended to fill a need for quantitative evaluation of articulation so that a longi- 
tudinal plot can be made of a child's development of speech sound specificity (Sharf, 

Baehr, and Fleming, 1967). j 

The method should also prove to be a useful tool in other investigations in which | 

speech samples are to be compared, either against each other or against a standard. This 
^would include studies of dialect, speech disorders, etc. I 



r 
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With appropriate modifications, the method may be useful as a clinical tool in both 
evaluation and therapy planning* 

Fi na l **y, the method may be of considerable pedagogical use for evaluating and train- 
ing second- language learners* 

Summary 

A method for the quantitative analysis of deviant articulation has been proposed* 

The method is based on (1) analphabetic transcription using binary articulatory categories, 
and (2) analysis of the transcription in terms of the perceptual performance of the trans- 
criber, as measured by the Theory of Signal Detectability* 
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Fig* 1* Schematic of transcription equipment* 
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Figc 2- Transcription form with transcription of the word 'phonetic* 
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Fig. 4. Hypothetical distribution of voiced and voiceless sounds. 
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Fig. 8. Confusion matrix summarizing transcriber's responses in regard to speaker's 
intentions for a given category. 




